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(1) RETRIEVE 
DOCUMENT ID- WORD FREQ. 
PAIRS FOR FIRST DOCUMENT 



@ Method for performing a search of a plurality of documents for similarity to a query. 

(57) A method for performing a search of a plurali- 
ty of documents for similarity to a query word 
includes retrieving a first document (20), and 
determining (21,23) a number of occurrences of 
the at least one query word In the first docu- 
ment Then, a next document is retrieved (25) 
and a number of occurrences of the at least one 
query word in the next document is determined 
(27,28). The steps are repeated (30) until each of 
the plurality of documents have been retrieved, 
and the number of occurrences of the at least 
one query word has been determined in each of 
the plurality of documents. The query word can 
include a plurality of query words, all of which 
are searched in each document, in turn, rather 
than being searched word by word in the whole 
collection of documents. The documents are 
then ranked according to the number of occurr- 
ences of the query words determined in each 
document, and a list of documents is produced 
according to the document ranking. 



(2) PROCESS ALL QUERY 
WORDS VIS-A-VIS 
FIRST DOCUMENT 



(3) COMPUTE COMPLETE DOCUMENT 
f *SC0RE" 



FOR FIRST DOCUMENT 



(4) RETRIEVE 
DOCUMENT ID -WORD FREQ. 
PAIRS FOR NEXT DOCUMENT 
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(5) PROCESS All QUERY 
WORDS VIS-A-VIS NEXT 
DOCUMENT 
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(6) COMPUTE COMPLETE DOCUMENT 
"SCORE" 



FOR NEXT DOCUMENT 
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(7) REPEAT STEPS <4)-(6) 
UNTIL ALL DOCUMENTS 
PROCESSED 



(8) DISPLAY DOCUMENT 
LISTS BY "SCORES" 
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This invention relates to improvements in text and 
image processing methods and techniques, and more 
particularly to improvements in methods for word or 
term identification and location in document images, 
and still more particularly to improvements In meth- 
ods for computer searching a number of document im- 
ages for existence of query words or terms with re- 
duced memory requirements. 

There has been increasingly widespread interest 
in document processing, both in electronic and in pa- 
per document forms. Often it is desired to locate par- 
ticular search terms within a large corpus of docu- 
ments; for example, in performing research to locate 
papers or publications that pertain to particular sub- 
jects, in finding particular testimony in deposition or 
discovery documents that contain particular words or 
phrases, in locating relevant court decisions in a legal 
database that have certain key words, and in mani- 
fold other instances. 

Sometimes the documents are presented in elec- 
tronic form in which the document text and images 
have been encoded in an electronic memory media 
from which the documents can be retrieved for perus- 
al or for "hard copy" or paper reproduction. In the past, 
when a large number of such documents are to be 
searched to locate one or more query terms, usually 
words, an index is built against which the query terms 
are compared. Such index generally is formed of two 
parts. The first part is a document identifier (herein 
the "document id"). The document id is merely an 
identification of each document in the collection, and 
may be a number, key word or phrase, or other unique 
identifier. The second part is a word and the number 
of times the word appears in the document with which 
it is identified (herein the "word frequency"). 

In the past, as shown in Figure 1, to identify the 
particular documents in which search or query words 
exist, usually the index of all of the words is brought 
into a computer memory 10, and the query words are 
compared, one at time, against each of the words in 
the memory. As each word is compared, a "score" is 
kept of the documents in which it appears. Thus, a. 
first query word is processed 11, and a partial "score" 
is computed 13 for the first word. Then a next query 
word is processed 14, and a cumulative "score" is 
computed 16. As the successive query words and cu- 
mulative scores are processed until completed 17, 
the cumulative score is continued to be generated. Af- 
ter the last query word has been searched, the 
"scores" can be used to identify or sort the documents 
18 In order of the number of "hits" by the query words, 
and a list of documents found can be displayed 19. 

Such techniques, however, require a large 
amount of computer accessible memory, particularly 
for large document collections. The memory require- 
ment often makes it impractical for document search- 
ing on personal or portable computers, even if the 
documents are stored on large capacity memory 



disks, and generally require large, mainframe com- 
puters with associated large memories. 

In the field of image processing, recently, direct 
paper document searching techniques have been 

5 proposed In which one or more morphological proper- 
ties of the images on the document are processed 
and used for comparison against a query word, term 
or image. In accordance with such techniques, a 
document is scanned and the morphological proper- 

10 ties of its various images directly determined without 
decoding the content of the image. In performing 
searches of a large corpus of documents, however, 
one technique that can be used is to generate an in- 
dex similar to that described above, but with a list of 

15 frequencies of morphological properties used in place 
of the words. Again, especially in large document col- 
lections, a large amount of memory is required to per- 
form search queries. 

In light of the above, it is, therefore, an object of 

20 the invention to provide an improved method for per- 
forming a similarity search on a large collection of 
documents using less memory than conventional 
methods heretofore employed. 

It Is another object of the invention to provide an 

25 improved method of the type described that can be 
performed efficiently. 

The present invention provides a method of per- 
forming a search of a plurality of documents, accord- 
ing to claims 1, 2, 4, 5 and 6 of the appended claims. 

30 In accordance with a broad aspect of the inven- 
tion, a method for performing a search of a plurality 
of documents for similarity to a query term or word is 
presented. The method includes retrieving a first 
document, and determining a number of occurrences 

35 of the query word in the first document The method 
then includes retrieving a next document and deter- 
mining a number of occurrences of the query word in 
the next document The steps are repeated until each 
of the plurality of documents have been retrieved, and 

40 the number of occurrences of the query word has 
been determined in each of the plurality of docu- 
ments. 

The query word can include a plurality of query 
terms, all of which are searched in each document in 

45 turn, rather than being searched term by term in the 
whole collection of documents. The documents are 
then ranked according to the number of occurrences 
of the query words determined in each document 
and a list of documents is produced according to the 

so document ranking. 

In one embodiment, a list of words contained 
within the retrieved document is generated, and the 
query words are compared to the generated list of 
words. 

55 In another embodiment all of the query words 
are compared against a first portion of the docu- 
ments. Subsequently, all of the query words are com- 
pared against a second portion of the documents. 
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The documents are then ranked, according to the 
number of occurrences of the query words deter- 
mined in each document, and a list of the documents 
is generated according to the document ranking. 

In another embodiment, the documents are or- 5 
ganized into an inverted index. In this case, instead 
of retrieving a document, the segment of a list of docu- 
ment-id and term-frequency pairs related to the query 
term and the document is examined. 

The present invention further provides a pro- 
grammable document searching system when suit- 
ably programmed for carrying out the method of any 
of claims 1 to 10. 

The invention is illustrated in the accompanying 
drawing, in which: 

Figure 1 is a block diagram outlining the steps for 
performing a similarity search of a corpus of 
documents, in accordance with the prior art; and 
Figure 2 is a block diagram outlining the steps for 
performing a similarity search of a corpus of 
documents in accordance with a preferred em- 
bodiment of the invention. 
The present invention relates to a method for per- 
forming a search of a plurality of documents, which 
method may be carried out in any conventional infor- 
mation processing system, such as that schematical- 
ly illustrated in, and described with reference to, Fig. 
1 of European patent application 93306281.2, a copy 
of which was filed with the present application. 

This invention relates to techniques for perform- 
ing similarity searches of the type in which the simi- 
larity search is performed with a query formed of a 
sequence of one or more words, syllables, phrases, 
images, or the like. Although the term "query word" 
is used herein, it should be understood that the "word" 
refers to a word, a word portion, or portions of a docu- 
ment or image which comprises letters, numbers, or 
other language symbols including non-alphabetic lin- 
guistic characters such as ideograms or foreign sylla- 
bries, and word or character substitutes, such as 
"wildcard" characters or the like. The result of the sim- 
ilarity search is a ranked list of documents from the in- 
dexed collection that have the highest similarity quo- 
tient to the query. The similarity quotient of a docu- 
ment with regard to a query is a number that results 
from a user defined formula that may include the 
number of documents in which each query word ap- 
pears, the number of times it appears in each docu- 
ment and the number of documents in the corpus. In 
some instances, it may be desirable to include differ- 
ent weights to be applied that designate relative im- 
portance of query words, or order of appearance of 
query words, or other similar search criterion. 

In order to accomplish the similarity search ac- 
cording to the invention, an inverted index is prefer- 
ably used. The inverted index contains a list of pairs 
of document identifiers and word frequency for each 
unique word in the corpus, or collection of documents. 



The word frequency Is the number of times the word 
appears in the document identified by the document 
Id with which it is paired. The document id - word fre- 
quency pairs are preferably arranged in ascending or 
descending order by document id. 

The method of the invention is in contrast to pre- 
vious methods in which the calculation of a similarity 
quotient is usually made by going entirely through the 
list of pairs of document identifiers and word frequen- 
cies for a single query word, and as each query word 
Is being processed, computing partial scores for each 
document found in the list. In accordance with the 
method of a preferred embodiment of the invention, 
with reference now to Figure 2. rather than accessing 
all the document id - word frequency pairs for a query 
word before accessing those of another query word, 
the comparison is switched from one stream of docu- 
ment - word frequency pairs to another. Thus, all the 
document id - word frequency pairs for one document 
are visited before going on to others. 

Accordingly, the document id - word frequency 
pairs for the first document are retrieved 20 into a 
computer memory. Thus, it will be appreciated that 
the technique of the invention is particularly well suit- 
ed for use in memory constrained cases, and is ana- 
logous to an n-way merge algorithm, though in this 
case a merge is not being performed, but rather, a set 
of calculations is being done. 

Next, ail of the query words are compared, 
searched, or processed vis-a-vis the first document 
21 , and a complete document "score" is computed for 
the first document 23. In performing a similarity 
search in accordance with the invention, it is desir- 
able to keep a list of all the documents in the collec- 
tion, or at least a list of all the documents that have 
been seen in the lists being processed. This is desir- 
able in order to track the partial score of the docu- 
ments. This list can be accessed at points corre- 
sponding to the document id portions of the docu- 
ment id fr - word frequency pairs being processed. 
Thus, as the list for each query word Is processed the 
document list may be accessed at increasing (or de- 
creasing) points, depending upon the ordering of the 
document ids. 

The process is continued by retrieving the next 
document id - word frequency pairs for the next docu- 
ment 25 into the computer memory, and again proc- 
essing all of the query words vis-a-vis the next docu- 
ment 27, and a new "score" is computed for the next 
document 28. The process is continued 30 until all of 
the documents have been processed. Once all the 
query words have been processed the fully computed 
or cumulative "scores" are sorted into rank order and 
the list displayed 31 . Alternatively, in order to produce 
a sorted list immediately at the end of the process, 
each time a partial score is computed, the changed 
document score can be repositioned in the ranking as 
necessary. 
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Again, In contrast to previous techniques in 
which, rf there was not sufficient memory in the sys- 
tem to keep the whole list in memory together with 
the portion of the query word's list of document id - 
word frequency pairs being processed, most of the 
document list was paged in from an external store, for 
comparison with each query word. In the technique of 
the invention, one query word's stream of pairs is 
switched to another to make all the calculations, for 
example, for the lowest document id in all of the va- 
rious lists before going on to the second lowest docu- 
ment id and so forth. In accordance with the inven- 
tion, there need only be enough memory to contain 
the entry for one document in the document list at a 
time, and for each query word, one entry in the list of 
pairs of document id - word frequency. Since for large 
document collections the list of documents will be 
very large this enables computation with a much 
smaller memory requirement than previous techni- 
ques. 

it will be noted that it may be necessary to per- 
form additional computation than previous techni- 
ques in making comparisons between the identifica- 
tions of the current element in the various query word 
lists. However, this computation is inexpensive com- 
pared to disk input/output costs. 

An alternative embodiment of the method of the 
invention is to process some number of documents 
(more than one) at the same time. This number of 
documents could be determined at run time based on 
the available memory, or at compile time based on 
the expected target machine. Each list of document id 
- word frequency pairs would be processed until the 
document identifications exceeded the current range 
of document identifications being processed. Then 
computation would move on to the next query word 
list. This variation decreases the amount of extra 
computation done, although it does not eliminate it 
entirely, and requires more memory, though not as 
much as previous approaches. 



Claims 

1. A method for performing a search of a plurality of 
documents for similarity to a query, comprising: 

(a) retrieving a first document; 

(b) determining a number of occurrences of 
said query in said first document; 

(c) retrieving a next document; 

(d) determining a number of occurrences of 
said query in said next document; 

(e) repeating steps (c) and (d) until each of 
said plurality of documents have been re- 
trieved, and the number of occurrences of 
said query has been determined in each of 
said plurality of documents. 



2. A met hod for performing a search of a plurality of 
documents for similarity to a plurality of query 
words, comprising: 

(a) retrieving a first document; 

5 (b) determining a number of occurrences of 

each of said plurality of query words in said 
first document; 

(c) retrieving a next document; 

(d) determining a number of occurrences of 
10 each of said query words in said next docu- 
ment; 

(e) repeating steps (c) and (d) until each of 
said plurality of documents have been re- 
trieved, and the number of occurrences of 

f 5 each of said plurality of query words has been 

determined in each of said plurality of docu- 
ments. 

3. The method of claim 1 or 2 wherein said query 
20 comprises a number of query words. 

4. A method for performing a search of a plurality of 
documents for similarity to a plurality of query 
words, comprising: 

25 (a) retrieving each of said documents in turn; 

(b) determining a number of occurrences of 
each of said plurality of query words in each 
of said documents in turn when each of said 
documents is retrieved. 

30 

5. A method for performing a search of a plurality of 
documents for similarity to a plurality of query 
words, comprising: 

(a) retrieving a first portion of said plurality of 
35 documents; 

(b) determining a number of occurrences of 
each of said plurality of query words In each 
document in said first portion of said plurality 
of documents; 

40 (c) retrieving a second portion of said plurality 

of documents; 

(d) determining a number of occurrences of 
each of said plurality of query words in each 
document in said second portion of said plur- 
45 ality of documents. 

6. A method for performing a search of a plurality of 
documents for similarity to a plurality of query 
words, comprising: 

50 generating an index of entries for all words 

of all of said documents, each of said documents 
being identified by a document identifier, each 
entry containing a document identifier and a num- 
ber of occurrences that a word appears in the 

55 identified document; 

for each document identifier, in turn, com- 
paring each of said plurality of query words to 
each word coupled with of each document iden- 
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tifier. 

7. The method of any of claims 3 to 6 further com- 
prising ranking the documents according to the 
number of occurrences of said query words de- 
termined in each document 

8. The method of claim 7 further comprising produc- 
ing a list of documents according to the document 
ranking. 

9. The method of any of the preceding claims 
wherein said steps of retrieving a document com- 
prises retrieving an image of the retrieved docu- 
ment into an electronic memory. 

10. The method of any of the preceding claims 
wherein said step of determining a number of oc- 
currences of each of said query words comprises 
generating a list of words contained within the re- 
trieved document, and comparing the query 
words against the generated list of words. 
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PAIRS FOR ALL DOCUMENTS 



(2) PROCESS ALL 
DOCUMENTS VIS-A-VIS A 
FIRST QUERY WORD 



(3) COMPUTE PARTIAL 
"SCORE" FOR FIRST QUERY 
WORD 
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(4) PROCESS ALL DOCUMENTS 
FOR NEXT QUERY WORD 
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(5) COMPUTE CUMULATIVE 
"SCORE" WITH NEXT QUERY 
• WORD RESULTS 
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(6) REPEAT STEPS (4)&(5) 
UNTIL ALL QUERY WORDS 
PROCESSED 
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(7) SORT DOCUMENTS 
BY SCORES 
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(1) RETRIEVE 
DOCUMENT ID- WORD FREQ. 
PAIRS FOR FIRST DOCUMENT 
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(2) PROCESS All QUERY 
WORDS VIS-A-VIS 
FIRST DOCUMENT 
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(3) COMPUTE COMPLETE DOCUMENT 
"SCORE" FOR FIRST DOCUMENT 
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(4) RETRIEVE 
DOCUMENT ID -WORD FREQ. 
PAIRS FOR NEXT DOCUMENT 
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(5) PROCESS ALL QUERY 
WORDS VIS-A-VIS NEXT 
DOCUMENT 
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(6) COMPUTE COMPLETE DOCUMENT 
"SCORE" FOR NEXT DOCUMENT 
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(7) REPEAT STEPS (4)-(6) 
UNTIL ALL DOCUMENTS 
PROCESSED 



(8) DISPLAY DOCUMENT 
LISTS BY "SCORES" 
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